Several models will be built based on the historical data for share values of Stock 1 in order to predict future values (on a daily basis). More specifically, 3 models of increasing complexity will be considered constructed, optimized and tested:
SARIMAXTBATS, which can capture more than 1 seasonal trendsOUTLINE
Module 1 - LOAD & CLEAN DATA
Module 2 - EXPLORATORY DATA ANALYSIS
| date | open | high | low | close | volume | |
|---|---|---|---|---|---|---|
| 0 | 1997-02-27 | 288.0002 | 288.0002 | 282.0002 | 285.1202 | 194 |
| 1 | 1997-02-28 | 282.0002 | 285.1202 | 282.0002 | 285.1202 | 25 |
| 2 | 1997-03-03 | 288.0002 | 288.0002 | 279.1202 | 279.1202 | 142 |
| 3 | 1997-03-04 | 278.8802 | 278.8802 | 278.8802 | 278.8802 | 0 |
| 4 | 1997-03-05 | 282.0002 | 285.1202 | 282.0002 | 285.1202 | 136 |
One can see that the data has several variables for which it keeps track: date of record date, value at opening open, value at closing close, highest daily value high, lowest daily value low and daily volume of transactions volume. Intuitively, open and close should be quite correlated, which will be shown in Module 2 - EDA.
Several steps will be taken to make sure that the data is ready for modelling:
Feature types after cleaning: date datetime64[ns] open float64 high float64 low float64 close float64 volume int64 dtype: object -------------------- Number of null values for unmodified data: open 0 high 0 low 0 close 0 volume 0 dtype: int64 -------------------- There are a total of 244 missing business days. The missing business days will be imputed in a forward-fill fashion. Null values: open 244 high 244 low 244 close 244 volume 244 dtype: int64
We can observe that all the features are in numeric format and the few null values present in the dataset were successfully imputed.
The first step is to simply plot the values for close and open in order to be able to zero-in on any obvious trends.
Below are the main points observed for the data:
Quite a lot of variability initially
The values seem to stabilize at ~2018
Thecloseandopenvalues appear to be quite correlated. Therefore one could focus exclusively onclosefor modelling.
Only data from 2020 onwards will be considered for downstream analysis.
The correlation plot above confirms that open and close values are virtually perfectly correlated and only one is needed for predicting the daily value of the shares (close).
Let us quickly look at the distribution of the variables in the dataset.
volume, close, and openare right-skewed. These values can be transformed by taking their log. This will make them more normally distributed, if one wished to fit a linear regression, which we will abstain from at the present moment.
Next, seasonality is determined by applying seasonal decomposition as plotted below
Clearly, the data is static (residuals centred around 0) even though one can observe some heteroscedasticity (RESIDUALS plot).
There is a bit of an oscilating trend in the TREND plot that remains after factoring out the seasonal component.
Finally, there is a clear seasonality as illustrated by the SEASONAL plot.
The model will be optimized so that it captures as much of the trend and seasonal components in making future predictions.
The parameters for fitting the time series model using SARIMAX is discussed in detail in Stocks_1_SARIMA.ipynb and the associated optimisation code provided in the Stocks_1_SARIMA.py script. Below, the performance of the optimal model will be assessed on the training data, the validation data (walk-forward validation) and on new daily data that is obtained after the model was trained.